Disk-Based Sampling for Outlier Detection in High Dimensional Data
نویسندگان
چکیده
We propose an efficient sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the first phase, we combine a “sampling” strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the running time has linear complexity with respect to the size and dimensionality of the data set. An additional data scan, which constitutes the second phase, extracts the actual outliers from the candidate set. The running time for this phase has complexity O(CN) where C and N are the size of the candidate set and the data set respectively. The major strengths of the proposed approach are that (1) no partitioning of the dimensions is required thus making it particularly suitable for high dimensional data and (2) a small sampling set (0.5% of the original data set) can discover more than 99% of all the outliers identified by a full brute-force approach. We present a detailed experimental evaluation of our proposed method on real and synthetic data sets and compare our method with another sampling approach.
منابع مشابه
Disk-Based Successive Sampling for Outlier Detection in High Dimensional Data
We propose a sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the first phase, we combine a “successive sampling” strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the running time has linear complexity with respect to the size and dimensionali...
متن کاملRapid Distance-Based Outlier Detection via Sampling
Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling...
متن کاملOutlier Detection for Support Vector Machine using Minimum Covariance Determinant Estimator
The purpose of this paper is to identify the effective points on the performance of one of the important algorithm of data mining namely support vector machine. The final classification decision has been made based on the small portion of data called support vectors. So, existence of the atypical observations in the aforementioned points, will result in deviation from the correct decision. Thus...
متن کاملRobust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data
Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...
متن کاملOutlier Detection in High Dimensional, Spatial and Sequential Data Sets
Of all the data mining techniques, outlier detection seems closest to the definition of “discovering nuggets of information” in large databases. When an outlier is detected, and determined to be genuine, it can provide insights, which can radically change our understanding of the underlying process. The purpose of the research underlying this thesis was to investigate and devise methods to mine...
متن کامل